Airlines On-time Performance Data

by Suresh K

Preliminary Wrangling

The US Department of Transport Bureau of Statistics publishes Flight on-time monthly data the dataset primarly show the timeliness of flights, their origin and destimation details. Further details about the dataset can be found at https://www.transtats.bts.gov/Fields.asp The data is avaiable at https://www.transtats.bts.gov/DL_SelectFields.asp for public to download.

For this Analysis let us cosider latest data available, that is first 2 quarters of 2019. Jan to Jun 2019 data has been downloaded in 6 seperate comma seperated file.

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
In [3]:
df_jan = pd.read_csv('T_ONTIME_REPORTING_Jan_2019.csv',
                     engine='c')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3057: DtypeWarning: Columns (48) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [4]:
print(df_jan.describe(include='all'))
            YEAR   QUARTER     MONTH   DAY_OF_MONTH    DAY_OF_WEEK  \
count   583985.0  583985.0  583985.0  583985.000000  583985.000000   
unique       NaN       NaN       NaN            NaN            NaN   
top          NaN       NaN       NaN            NaN            NaN   
freq         NaN       NaN       NaN            NaN            NaN   
mean      2019.0       1.0       1.0      15.960088       3.835626   
std          0.0       0.0       0.0       8.987942       1.921899   
min       2019.0       1.0       1.0       1.000000       1.000000   
25%       2019.0       1.0       1.0       8.000000       2.000000   
50%       2019.0       1.0       1.0      16.000000       4.000000   
75%       2019.0       1.0       1.0      24.000000       5.000000   
max       2019.0       1.0       1.0      31.000000       7.000000   

           FL_DATE OP_UNIQUE_CARRIER  OP_CARRIER_AIRLINE_ID OP_CARRIER  \
count       583985            583985          583985.000000     583985   
unique          31                17                    NaN         17   
top     2019-01-02                WN                    NaN         WN   
freq         20384            111312                    NaN     111312   
mean           NaN               NaN           19983.213168        NaN   
std            NaN               NaN             377.724638        NaN   
min            NaN               NaN           19393.000000        NaN   
25%            NaN               NaN           19790.000000        NaN   
50%            NaN               NaN           19977.000000        NaN   
75%            NaN               NaN           20368.000000        NaN   
max            NaN               NaN           20452.000000        NaN   

       TAIL_NUM  ...  DIV_ARR_DELAY  DIV_DISTANCE  DIV1_AIRPORT  \
count    581442  ...     961.000000   1296.000000          1406   
unique     5446  ...            NaN           NaN           172   
top      N488HA  ...            NaN           NaN           DEN   
freq        361  ...            NaN           NaN            83   
mean        NaN  ...     303.812695     62.152006           NaN   
std         NaN  ...     276.314032    195.236123           NaN   
min         NaN  ...      24.000000      0.000000           NaN   
25%         NaN  ...     134.000000      0.000000           NaN   
50%         NaN  ...     192.000000      0.000000           NaN   
75%         NaN  ...     333.000000     30.000000           NaN   
max         NaN  ...    1542.000000   2586.000000           NaN   

        DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON DIV1_TOTAL_GTIME  \
count       1406.000000        1.406000e+03    1406.000000      1406.000000   
unique              NaN                 NaN            NaN              NaN   
top                 NaN                 NaN            NaN              NaN   
freq                NaN                 NaN            NaN              NaN   
mean       12777.906117        1.277794e+06    1505.906828        30.332148   
std         1616.155962        1.616155e+05     540.668597        27.893773   
min        10135.000000        1.013505e+06       1.000000         1.000000   
25%        11292.000000        1.129202e+06    1136.000000        11.000000   
50%        12854.000000        1.285402e+06    1519.000000        22.000000   
75%        14107.000000        1.410702e+06    1945.750000        40.000000   
max        16042.000000        1.604205e+06    2400.000000       189.000000   

        DIV1_LONGEST_GTIME DIV1_WHEELS_OFF  DIV1_TAIL_NUM  
count          1406.000000      968.000000            968  
unique                 NaN             NaN            871  
top                    NaN             NaN         N442SW  
freq                   NaN             NaN              4  
mean             24.641536     1441.678719            NaN  
std              24.904763      568.613560            NaN  
min               1.000000        1.000000            NaN  
25%               9.000000     1110.000000            NaN  
50%              16.000000     1443.000000            NaN  
75%              30.000000     1925.250000            NaN  
max             171.000000     2400.000000            NaN  

[11 rows x 77 columns]
In [5]:
pd.options.display.max_columns = None
display(df_jan.sample(5))
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE OP_UNIQUE_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME LONGEST_ADD_GTIME DIV_AIRPORT_LANDINGS DIV_REACHED_DEST DIV_ACTUAL_ELAPSED_TIME DIV_ARR_DELAY DIV_DISTANCE DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON DIV1_TOTAL_GTIME DIV1_LONGEST_GTIME DIV1_WHEELS_OFF DIV1_TAIL_NUM
511243 2019 1 1 6 7 2019-01-06 DL 19790 DL N918DE 2274 10994 1099402 30994 CHS Charleston, SC SC 45 South Carolina 37 10397 1039707 30397 ATL Atlanta, GA GA 13 Georgia 34 830 827.0 -3.0 0.0 0.0 -1.0 0800-0859 12.0 839.0 923.0 3.0 950 926.0 -24.0 0.0 0.0 -2.0 0900-0959 0 NaN 0 80.0 59.0 44.0 1 259 2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32264 2019 1 1 12 6 2019-01-12 AA 19805 AA N839NN 289 15304 1530402 33195 TPA Tampa, FL FL 12 Florida 33 14107 1410702 30466 PHX Phoenix, AZ AZ 4 Arizona 81 2001 1943.0 -18.0 0.0 0.0 -2.0 2000-2059 12.0 1955.0 2211.0 17.0 2250 2228.0 -22.0 0.0 0.0 -2.0 2200-2259 0 NaN 0 289.0 285.0 256.0 1 1788 8 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
193445 2019 1 1 25 5 2019-01-25 WN 19393 WN N415WN 2291 14683 1468305 33214 SAT San Antonio, TX TX 48 Texas 74 14679 1467903 33570 SAN San Diego, CA CA 6 California 91 925 932.0 7.0 7.0 0.0 0.0 0900-0959 13.0 945.0 1055.0 20.0 1035 1115.0 40.0 40.0 1.0 2.0 1000-1059 0 NaN 0 190.0 223.0 190.0 1 1129 5 6.0 0.0 33.0 0.0 1.0 NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
465079 2019 1 1 19 6 2019-01-19 NK 20416 NK N605NK 1068 14747 1474703 30559 SEA Seattle, WA WA 53 Washington 93 12889 1288903 32211 LAS Las Vegas, NV NV 32 Nevada 85 600 551.0 -9.0 0.0 0.0 -1.0 0600-0659 14.0 605.0 802.0 6.0 824 808.0 -16.0 0.0 0.0 -2.0 0800-0859 0 NaN 0 144.0 137.0 117.0 1 867 4 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
353574 2019 1 1 30 3 2019-01-30 UA 19977 UA N39418 1727 13204 1320402 31454 MCO Orlando, FL FL 12 Florida 33 11618 1161802 31703 EWR Newark, NJ NJ 34 New Jersey 21 600 553.0 -7.0 0.0 0.0 -1.0 0600-0659 11.0 604.0 801.0 9.0 834 810.0 -24.0 0.0 0.0 -2.0 0800-0859 0 NaN 0 154.0 137.0 117.0 1 937 4 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Let us only condsidered delayed flights which are not cancelled or diverted as they form the bulk of the data
Also Diverted and Cancelled flight most often will have Weather conditions, which could be beyond the control to remediate
In [6]:
df_jan[df_jan['DIV_REACHED_DEST'].notnull()].sample(5)
Out[6]:
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE OP_UNIQUE_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME LONGEST_ADD_GTIME DIV_AIRPORT_LANDINGS DIV_REACHED_DEST DIV_ACTUAL_ELAPSED_TIME DIV_ARR_DELAY DIV_DISTANCE DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON DIV1_TOTAL_GTIME DIV1_LONGEST_GTIME DIV1_WHEELS_OFF DIV1_TAIL_NUM
209160 2019 1 1 27 7 2019-01-27 OH 20397 OH N215PS 5163 14100 1410005 34100 PHL Philadelphia, PA PA 42 Pennsylvania 23 10361 1036105 30361 ART Watertown, NY NY 36 New York 22 1819 1817.0 -2.0 0.0 0.0 -1.0 1800-1859 17.0 1834.0 2139.0 8.0 1940 2147.0 NaN NaN NaN NaN 1900-1959 0 NaN 1 81.0 NaN NaN 1 287 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 210.0 127.0 0.0 SYR 15096.0 1509602.0 2025.0 53.0 53.0 2118.0 N215PS
402959 2019 1 1 20 7 2019-01-20 WN 19393 WN N260WN 4624 10821 1082106 30852 BWI Baltimore, MD MD 24 Maryland 35 14576 1457606 34576 ROC Rochester, NY NY 36 New York 22 1115 1111.0 -4.0 0.0 0.0 -1.0 1100-1159 12.0 1123.0 NaN NaN 1230 NaN NaN NaN NaN NaN 1200-1259 0 NaN 1 75.0 NaN NaN 1 277 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 0.0 NaN NaN 1033.0 MCO 13204.0 1320402.0 1334.0 7.0 7.0 NaN NaN
174204 2019 1 1 22 2 2019-01-22 WN 19393 WN N760SW 293 13204 1320402 31454 MCO Orlando, FL FL 12 Florida 33 13198 1319801 33198 MCI Kansas City, MO MO 29 Missouri 64 2120 2126.0 6.0 6.0 0.0 0.0 2100-2159 7.0 2133.0 46.0 12.0 2315 58.0 NaN NaN NaN NaN 2300-2359 0 NaN 1 175.0 NaN NaN 1 1072 5 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 272.0 103.0 0.0 TUL 15370.0 1537002.0 2315.0 22.0 19.0 14.0 N760SW
475110 2019 1 1 22 2 2019-01-22 OO 20304 OO N606SK 3585 13487 1348702 31650 MSP Minneapolis, MN MN 27 Minnesota 63 13232 1323202 30977 MDW Chicago, IL IL 17 Illinois 41 1251 1349.0 58.0 58.0 1.0 3.0 1200-1259 56.0 1445.0 956.0 7.0 1441 1003.0 NaN NaN NaN NaN 1400-1459 0 NaN 1 110.0 NaN NaN 1 349 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 1214.0 1162.0 0.0 ORD 13930.0 1393007.0 1614.0 127.0 84.0 935.0 N606SK
342570 2019 1 1 3 4 2019-01-03 YX 20452 YX N202JQ 5869 12478 1247805 31703 JFK New York, NY NY 36 New York 22 11298 1129806 30194 DFW Dallas/Fort Worth, TX TX 48 Texas 74 1820 1814.0 -6.0 0.0 0.0 -1.0 1800-1859 61.0 1915.0 2256.0 16.0 2153 2312.0 NaN NaN NaN NaN 2100-2159 0 NaN 1 273.0 NaN NaN 1 1391 6 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 358.0 79.0 0.0 BNA 10693.0 1069302.0 2038.0 32.0 20.0 2117.0 N202JQ
In [7]:
df_jan[df_jan['DIVERTED']!= 0].sample(10) 
Out[7]:
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE OP_UNIQUE_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME LONGEST_ADD_GTIME DIV_AIRPORT_LANDINGS DIV_REACHED_DEST DIV_ACTUAL_ELAPSED_TIME DIV_ARR_DELAY DIV_DISTANCE DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON DIV1_TOTAL_GTIME DIV1_LONGEST_GTIME DIV1_WHEELS_OFF DIV1_TAIL_NUM
357177 2019 1 1 28 1 2019-01-28 UA 19977 UA N77871 597 12889 1288903 32211 LAS Las Vegas, NV NV 32 Nevada 85 13930 1393007 30977 ORD Chicago, IL IL 17 Illinois 41 1057 1229.0 92.0 92.0 1.0 6.0 1000-1059 19.0 1248.0 947.0 9.0 1629 956.0 NaN NaN NaN NaN 1600-1659 0 NaN 1 212.0 NaN NaN 1 1514 7 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 1167.0 1047.0 0.0 IND 12339.0 1233904.0 1946.0 35.0 19.0 1008.0 N77871
502248 2019 1 1 13 7 2019-01-13 AS 19930 AS N285VA 1025 12478 1247805 31703 JFK New York, NY NY 36 New York 22 14771 1477104 32457 SFO San Francisco, CA CA 6 California 91 1130 1627.0 297.0 297.0 1.0 12.0 1100-1159 45.0 1712.0 2052.0 4.0 1510 2056.0 NaN NaN NaN NaN 1500-1559 0 NaN 1 400.0 NaN NaN 1 2586 11 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 449.0 346.0 0.0 SLC 14869.0 1486903.0 1930.0 16.0 11.0 2024.0 N285VA
311115 2019 1 1 25 5 2019-01-25 DL 19790 DL N359NW 2470 12953 1295304 31703 LGA New York, NY NY 36 New York 22 14027 1402702 34027 PBI West Palm Beach/Palm Beach, FL FL 12 Florida 33 805 804.0 -1.0 0.0 0.0 -1.0 0800-0859 66.0 910.0 1424.0 4.0 1120 1428.0 NaN NaN NaN NaN 1100-1159 0 NaN 1 195.0 NaN NaN 1 1035 5 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 384.0 188.0 0.0 ATL 10397.0 1039707.0 1132.0 50.0 44.0 1302.0 N359NW
271269 2019 1 1 6 7 2019-01-06 AS 19930 AS N638VA 1127 10821 1082106 30852 BWI Baltimore, MD MD 24 Maryland 35 14771 1477104 32457 SFO San Francisco, CA CA 6 California 91 2000 2210.0 130.0 130.0 1.0 8.0 2000-2059 17.0 2227.0 253.0 30.0 2322 323.0 NaN NaN NaN NaN 2300-2359 0 NaN 1 382.0 NaN NaN 1 2457 10 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 493.0 241.0 0.0 LAS 12889.0 1288903.0 48.0 16.0 10.0 140.0 N638VA
203471 2019 1 1 4 5 2019-01-04 OH 20397 OH N206PS 5450 12953 1295304 31703 LGA New York, NY NY 36 New York 22 11267 1126702 31267 DAY Dayton, OH OH 39 Ohio 44 1305 1305.0 0.0 0.0 0.0 0.0 1300-1359 15.0 1320.0 1912.0 4.0 1510 1916.0 NaN NaN NaN NaN 1500-1559 0 NaN 1 125.0 NaN NaN 1 549 3 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 371.0 246.0 0.0 LGA 12953.0 1295304.0 1336.0 58.0 48.0 1743.0 N206PS
503383 2019 1 1 15 2 2019-01-15 AS 19930 AS N613AS 538 11630 1163002 31517 FAI Fairbanks, AK AK 2 Alaska 1 14709 1470904 30073 SCC Deadhorse, AK AK 2 Alaska 1 1300 1601.0 181.0 181.0 1.0 12.0 1300-1359 21.0 1622.0 NaN NaN 1420 NaN NaN NaN NaN NaN 1400-1459 0 NaN 1 80.0 NaN NaN 1 373 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 0.0 NaN NaN 626.0 ANC 10299.0 1029906.0 1702.0 4.0 4.0 NaN NaN
463143 2019 1 1 22 2 2019-01-22 NK 20416 NK N683NK 892 13204 1320402 31454 MCO Orlando, FL FL 12 Florida 33 11433 1143302 31295 DTW Detroit, MI MI 26 Michigan 43 1859 1856.0 -3.0 0.0 0.0 -1.0 1800-1859 16.0 1912.0 1231.0 22.0 2143 1253.0 NaN NaN NaN NaN 2100-2159 0 NaN 1 164.0 NaN NaN 1 957 4 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 1077.0 910.0 0.0 CLE 11042.0 1104205.0 2147.0 66.0 46.0 1154.0 N683NK
155003 2019 1 1 3 4 2019-01-03 WN 19393 WN N8503A 166 12191 1219102 31453 HOU Houston, TX TX 48 Texas 74 13851 1385103 33851 OKC Oklahoma City, OK OK 40 Oklahoma 73 1105 1118.0 13.0 13.0 0.0 0.0 1100-1159 10.0 1128.0 NaN NaN 1230 NaN NaN NaN NaN NaN 1200-1259 0 NaN 1 85.0 NaN NaN 1 419 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 0.0 NaN NaN 986.0 LAS 12889.0 1288903.0 1228.0 3.0 3.0 NaN NaN
188618 2019 1 1 24 4 2019-01-24 WN 19393 WN N8716B 1710 13204 1320402 31454 MCO Orlando, FL FL 12 Florida 33 12191 1219102 31453 HOU Houston, TX TX 48 Texas 74 1245 1240.0 -5.0 0.0 0.0 -1.0 1200-1259 12.0 1252.0 1531.0 5.0 1420 1536.0 NaN NaN NaN NaN 1400-1459 0 NaN 1 155.0 NaN NaN 1 849 4 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 236.0 76.0 0.0 MSY 13495.0 1349505.0 1330.0 6.0 4.0 1437.0 N8716B
227732 2019 1 1 6 7 2019-01-06 DL 19790 DL N776DE 395 14869 1486903 34614 SLC Salt Lake City, UT UT 49 Utah 87 14570 1457002 34570 RNO Reno, NV NV 32 Nevada 85 2041 2054.0 13.0 13.0 0.0 0.0 2000-2059 25.0 2119.0 636.0 5.0 2122 641.0 NaN NaN NaN NaN 2100-2159 0 NaN 1 101.0 NaN NaN 1 422 2 NaN NaN NaN NaN NaN NaN NaN NaN 1 1.0 647.0 559.0 0.0 SLC 14869.0 1486903.0 2243.0 26.0 19.0 626.0 N776DE
In [8]:
df_feb = pd.read_csv('T_ONTIME_REPORTING_Feb_2019.csv',
                     engine='c')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3057: DtypeWarning: Columns (48,69,76) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [9]:
df_mar = pd.read_csv('T_ONTIME_REPORTING_Mar_2019.csv',
                     engine='c')
In [10]:
df_apr = pd.read_csv('T_ONTIME_REPORTING_Apr_2019.csv',
                     engine='c')
In [11]:
df_may = pd.read_csv('T_ONTIME_REPORTING_May_2019.csv',
                     engine='c')
In [12]:
df_jun = pd.read_csv('T_ONTIME_REPORTING_Jun_2019.csv',
                     engine='c')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3057: DtypeWarning: Columns (42,70) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [13]:
# Concatinate all frames to 1 dataframe
frames = [df_jan, df_feb, df_mar, df_apr, df_may, df_jun]
df_H12019 = pd.concat(frames, sort=True)
In [14]:
df_H12019.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3634338 entries, 0 to 636690
Data columns (total 77 columns):
ACTUAL_ELAPSED_TIME        float64
AIR_TIME                   float64
ARR_DEL15                  float64
ARR_DELAY                  float64
ARR_DELAY_GROUP            float64
ARR_DELAY_NEW              float64
ARR_TIME                   float64
ARR_TIME_BLK               object
CANCELLATION_CODE          object
CANCELLED                  int64
CARRIER_DELAY              float64
CRS_ARR_TIME               int64
CRS_DEP_TIME               int64
CRS_ELAPSED_TIME           float64
DAY_OF_MONTH               int64
DAY_OF_WEEK                int64
DEP_DEL15                  float64
DEP_DELAY                  float64
DEP_DELAY_GROUP            float64
DEP_DELAY_NEW              float64
DEP_TIME                   float64
DEP_TIME_BLK               object
DEST                       object
DEST_AIRPORT_ID            int64
DEST_AIRPORT_SEQ_ID        int64
DEST_CITY_MARKET_ID        int64
DEST_CITY_NAME             object
DEST_STATE_ABR             object
DEST_STATE_FIPS            int64
DEST_STATE_NM              object
DEST_WAC                   int64
DISTANCE                   int64
DISTANCE_GROUP             int64
DIV1_AIRPORT               object
DIV1_AIRPORT_ID            float64
DIV1_AIRPORT_SEQ_ID        float64
DIV1_LONGEST_GTIME         float64
DIV1_TAIL_NUM              object
DIV1_TOTAL_GTIME           float64
DIV1_WHEELS_OFF            float64
DIV1_WHEELS_ON             float64
DIVERTED                   int64
DIV_ACTUAL_ELAPSED_TIME    float64
DIV_AIRPORT_LANDINGS       int64
DIV_ARR_DELAY              float64
DIV_DISTANCE               float64
DIV_REACHED_DEST           float64
FIRST_DEP_TIME             float64
FLIGHTS                    int64
FL_DATE                    object
LATE_AIRCRAFT_DELAY        float64
LONGEST_ADD_GTIME          float64
MONTH                      int64
NAS_DELAY                  float64
OP_CARRIER                 object
OP_CARRIER_AIRLINE_ID      int64
OP_CARRIER_FL_NUM          int64
OP_UNIQUE_CARRIER          object
ORIGIN                     object
ORIGIN_AIRPORT_ID          int64
ORIGIN_AIRPORT_SEQ_ID      int64
ORIGIN_CITY_MARKET_ID      int64
ORIGIN_CITY_NAME           object
ORIGIN_STATE_ABR           object
ORIGIN_STATE_FIPS          float64
ORIGIN_STATE_NM            object
ORIGIN_WAC                 float64
QUARTER                    int64
SECURITY_DELAY             float64
TAIL_NUM                   object
TAXI_IN                    float64
TAXI_OUT                   float64
TOTAL_ADD_GTIME            float64
WEATHER_DELAY              float64
WHEELS_OFF                 float64
WHEELS_ON                  float64
YEAR                       int64
dtypes: float64(37), int64(23), object(17)
memory usage: 2.1+ GB

Let us get the data for rest of the months

The structure of your dataset?

The dataset is has main sections of data with data fields related to the section

  • Time Period
    • Year, Quater, Month, DayoftheMonth, DayoftheWeek, FlighDate
  • Airline
    • Detailed Airlines company and Filght
  • Origin
    • Details of originating City and Airport
  • Destination
    • Details of Destination City and Airport
  • Departure Performance
    • Metrics on depature time and delay
  • Arrival Performance
    • Metrics on Arrival time and delay
  • Cancellation and Diversions
    • Indicators of cancelled and divereted flights
  • Flight Summary
    • Metrics on planned and Actual flight time, distances etc.
  • Cause of Delay
  • Gate return to origin Aiprport
  • Diverted flight details

Further to the data fields are dervided into bin certain field like

  • Delay Time block - The hour duration of the day like 1900-1959
  • Delay Time Group - Blocks of every 15 min , 0-15 is 0 and 0-30 is 1 and so on, to indicate the sevirity of the delay.
  • Distant Group - The blocks of 250 kms is indicated as distant group. first 250 kms is 1 and upto 500 kms is 2 and so on.
  • Delay in minutes due to each Catergories of delay is shown seperately to show all causes of delay

The main feature(s) of interest.

The focus of the analysis is to find out what are the primary cause of day, In term of Day of the week , time of the day , routes causing delays

The features in the dataset that support the investigation.

By general sense we could say weather could cause the most delay follow by NAS ( Natioanal Air Saftey ) technical reasons. We could see by further doing the analysis

Let us for now ignore the delays due to flight cancellations and diverstions as it quite obvious will cause delay. We will come back to it latter

In [15]:
# We dont want both diverted and cancelled flights
df_ndely = df_H12019.query('CANCELLED != 1 & DIVERTED != 1')
df_ndely.sample(5)
Out[15]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR
573664 141.0 126.0 0.0 13.0 0.0 13.0 913.0 0900-0959 NaN 0 NaN 900 530 150.0 5 3 1.0 22.0 1.0 22.0 552.0 0001-0559 MCO 13204 1320402 31454 Orlando, FL FL 12 Florida 33 990 4 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-06-05 NaN NaN 6 NaN WN 19393 1404 WN NaN 13232 1323202 30977 NaN NaN NaN NaN NaN 2 NaN N208WN 9.0 6.0 NaN NaN 558.0 904.0 2019
322966 61.0 33.0 0.0 1.0 0.0 1.0 1546.0 1500-1559 NaN 0 NaN 1545 1437 68.0 31 7 0.0 8.0 0.0 8.0 1445.0 1400-1459 ATL 10397 1039707 30397 Atlanta, GA GA 13 Georgia 34 143 1 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-03-31 NaN NaN 3 NaN OO 20304 4240 OO AGS 10208 1020803 30208 Augusta, GA GA 13.0 Georgia 34.0 1 NaN N459SW 10.0 18.0 NaN NaN 1503.0 1536.0 2019
356795 151.0 113.0 1.0 319.0 12.0 319.0 3.0 1800-1859 NaN 0 0.0 1844 1700 164.0 28 1 1.0 332.0 12.0 332.0 2232.0 1700-1759 ORD 13930 1393007 30977 Chicago, IL IL 17 Illinois 41 719 3 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN 1653.0 1 2019-01-28 0.0 145.0 1 319.0 UA 19977 1180 UA EWR 11618 1161802 31703 Newark, NJ NJ 34.0 New Jersey 21.0 1 0.0 N69824 6.0 32.0 145.0 0.0 2304.0 2357.0 2019
75459 81.0 60.0 1.0 1247.0 12.0 1247.0 1310.0 1600-1659 NaN 0 7.0 1623 1455 88.0 23 3 1.0 1254.0 12.0 1254.0 1149.0 1400-1459 DEN 11292 1129202 30325 Denver, CO CO 8 Colorado 82 406 2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-01-23 1240.0 NaN 1 0.0 UA 19977 1002 UA JAC 12441 1244102 32441 Jackson, WY WY 56.0 Wyoming 88.0 1 0.0 N833UA 10.0 11.0 NaN 0.0 1200.0 1300.0 2019
424640 122.0 90.0 0.0 4.0 0.0 4.0 1528.0 1500-1559 NaN 0 NaN 1524 1328 116.0 12 3 0.0 -2.0 -1.0 0.0 1326.0 1300-1359 ATL 10397 1039707 30397 Atlanta, GA GA 13 Georgia 34 594 3 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-06-12 NaN NaN 6 NaN DL 19790 1460 DL NaN 13303 1330303 32467 NaN NaN NaN NaN NaN 2 NaN N364NW 20.0 12.0 NaN NaN 1338.0 1508.0 2019

Univariate Exploration

As we are most interested in the delay timing let us look how much are delay is generally seen. Let us have closer look at the distribution of delays
In [16]:
fig, ax = plt.subplots(nrows=2, figsize=[15,10])
sb.distplot(df_ndely['DEP_DELAY_NEW'],
           ax=ax[0])
sb.distplot(df_ndely['ARR_DELAY_NEW'],
           ax=ax[1])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x23b96eea3c8>
In [17]:
fig, ax = plt.subplots(nrows=2, figsize=[8,8])

var = ['DEP_DELAY_GROUP', 'ARR_DELAY_GROUP']

for i in range(len(var)):
    ax[i].hist(data=df_ndely, x=var[i])
    ax[i].set_xlabel('{} in n * 15 min blocks'.format(var[i]))
    ax[i].set_xticks(np.arange(-2,14,1))    

plt.show()
While we could see most of the flights take of early or on-time but we are more interested in delays. Let us haver a closer look at delay more than 15 mins
In [18]:
fig, ax = plt.subplots(nrows=2, figsize=[8,8])

var = ['DEP_DELAY_GROUP', 'ARR_DELAY_GROUP']

for i in range(len(var)):
    ax[i].hist(data=df_ndely, x=var[i])
    ax[i].set_xlabel('{} in n * 15 min blocks'.format(var[i]))
    ax[i].set_xlim((1,13))
    ax[i].set_ylim((0,400000))
    ax[i].set_xticks(np.arange(1,14,1))

plt.show()
In [19]:
t_bins = np.arange(min(df_ndely[var[i]]), max(df_ndely[var[i]])+1.0, 1.0)
len(t_bins)
Out[19]:
15
In [194]:
fig, ax = plt.subplots(nrows=2, figsize=[15,8])

var = ['DEP_DELAY', 'ARR_DELAY']

for i in range(len(var)):
    ax[i].hist(data=df_ndely, x=var[i], bins = t_bins, 
               
              )
    ax[i].set_xlabel('{} in minutes'.format(var[i]))
    ax[i].set_xlim(-60, 250)
    ax[i].set_xticks(np.arange(-60,250,10))
#     ax[i].axvline(df_ndely[var[i]].value_counts().index[0],c='r')
plt.show()
Let us explore other factors like time of the day and day of the weeks is cause for the delay. Let us filter the delayed flights with material delay
In [21]:
df_dely = df_H12019.query('DEP_DELAY_NEW > 0 | ARR_DELAY_NEW > 0')
In [22]:
fig, ax = plt.subplots(nrows=1, figsize=[8,5])

var = 'DAY_OF_WEEK'

ax.hist(data=df_dely, x=var, 
        width = 0.5
       )            
ax.set_xlabel('Day of the Week'.format(var))
ax.set_xticklabels(['x', 'Sun','Mon','Tue','Wed','Thu','Fri','Sat']) #, [1,2,3,4,5,6,7])
ax.axvline(df_ndely[var].value_counts().index[0],c='r')
plt.show()
Interestingly most delays are on Wednesday and Thurdays followed by saturdays and sundays. Fridays have least dealys. Let us explore what days cause more than 15 minitues delays
In [23]:
df_dely = df_H12019.query('DEP_DEL15 > 0 | ARR_DEL15 > 0')
In [24]:
fig, ax = plt.subplots(nrows=1, figsize=[8,5])

var = 'DAY_OF_WEEK'

ax.hist(data=df_dely, x=var, 
        width = 0.5,
        align='mid'
       )            
ax.set_xlabel('Day of the Week'.format(var))
ax.set_xticklabels(['x', 'Sun','Mon','Tue','Wed','Thu','Fri','Sat']) #, [1,2,3,4,5,6,7])
ax.axvline(df_dely[var].value_counts().index[0],c='r')
plt.show()
Same tread is seen when delay is greater than 15 minitues. Fridays have least numbers of dealys.
Let us explore the time of the day block to find which tme of the day cause most delays
In [25]:
df_Ddely = df_H12019.query('DEP_DELAY_NEW > 0')
In [26]:
x = df_Ddely['DEP_TIME_BLK'].unique() #.sort()
x.sort()
In [27]:
sb.countplot(data=df_Ddely, x='DEP_TIME_BLK',
             color='g',
             order=x,
            );
plt.xticks(rotation=90);
We could observe that departure delays are almost evently spread across between 1000 hrs to 2100 hrs, Most number of delays are between evening 5 pm to 6 pm.
Let us have a closer look at the arrival delays
In [28]:
df_Adely = df_H12019.query('ARR_DELAY_NEW > 0')
In [29]:
x = df_Adely['ARR_TIME_BLK'].unique() #.sort()
x.sort()
In [30]:
sb.countplot(data=df_dely, x='ARR_TIME_BLK',
             color='g',
             order=x,
            );
plt.xticks(rotation=90);
We could observe that arrival delays distribution is different from departure distribution, more delay are see towards end of the day between 4 pm to 10 pm. The least delay observed in morning arrival flights . This shows there is difference in departure and arrival delay patterns
Let us have look the causes are arrival and departure delay minutes. First let us verify only departure delay more 15 mins at this causes by summing up total delay minutes by categories
In [31]:
df_ddely = df_H12019.query('ARR_DEL15 == 0 and DEP_DEL15 > 0')
In [32]:
df_adely = df_H12019.query('DEP_DEL15 == 0 and ARR_DEL15 > 0 ')
In [33]:
df_xdely = df_H12019.query('DEP_DEL15 > 0 and ARR_DEL15 > 0 ')
In [34]:
x = df_ddely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
x
Out[34]:
CARRIER_DELAY          0.0
WEATHER_DELAY          0.0
NAS_DELAY              0.0
SECURITY_DELAY         0.0
LATE_AIRCRAFT_DELAY    0.0
dtype: float64
In [35]:
y = df_adely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
y
Out[35]:
CARRIER_DELAY           321838.0
WEATHER_DELAY            44798.0
NAS_DELAY              3818470.0
SECURITY_DELAY            2877.0
LATE_AIRCRAFT_DELAY     169491.0
dtype: float64
In [36]:
z = df_xdely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
z
Out[36]:
CARRIER_DELAY          14723949.0
WEATHER_DELAY           2908009.0
NAS_DELAY               8142991.0
SECURITY_DELAY            75373.0
LATE_AIRCRAFT_DELAY    19762370.0
dtype: float64
In [37]:
x.plot(kind='bar', title='Total Departure delay with Arrival Delay < 15 minutes by Categories')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x23b9dce1978>
There are no delays iattributed when Arrival is with in 15 mins of scheduled time and depature delay of any catergory is less than 15 minutes, This is an interesting finding
In [38]:
y.plot(kind='bar', title='Total Arrival delay mins with < 15 min departure Delay minutes by Categories')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x23b98a13978>
Extremly high duration of delay is due to NAS delay which mean flights are waiting to takeoff and land. This could be due to high air traffic, air traffic control delay. This give us an indication if we should Arrival delays are primarly caused due to getting signal to land and taxi. We also need to check if any specific airport is causing the landing delays
In [39]:
z.plot(kind='bar', title='Total delay minutes by Categories when both departure delay and arrival delay > 15 mins')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x23b9ddc6780>
Late arrival of aircraft from earlier flight is the most significant cause of delay. This is called delay propagation in aviation terms. refer https://aspmhelp.faa.gov/index.php/Types_of_Delay.
The delay created by flight operator to prepare the aircraft for flight is next significant cause of delay. This indicates we have check which airline companies are routes causing the the most delay
As we have noted that arrival major cause of arrival delay is NAS delay which is attribuatable to airports, Let us move on to Bivariate plot.

The distributions of variable(s) of interest as discussed here.

Delay Time Group is left skewed similarly both Depture Delay time and Arrvial departure time is left skewed. We could observer that most of the flights departed 5 mins earlier and arrived 10 mins sooner. But there is long tail of arrival delay steadly declining to 120 mins. But there are outliners on arrival delays.
Most of the flights departs on time and interestingly 5 to 1 minitue earlier . There is gradual decline in the number of depatures from 0 mintures to 110 minitues and bulk of the departure delay minitues are in in this range.

Week of the Day Wednesdays and Thursdays found to have most delays followed by Saturdays and Sundays. Fridays have least delays

Time of the Day block Departure delays are almost evently spread across between 1000 hrs to 2100 hrs, Most number of delays are between evening 5 pm to 6 pm. elay are see towards end of the day between 4 pm to 10 pm. The least delay observed in morning arrival flights

Delay Categories There are no arrival delay beyound 15 mins delays are not attribute to any of the delay catergories. The NAS delay is extremly high in case of arrival delay.

Of the features investigated, there are some usual distributions.

The fligts which are cancelled and Diverted are ommited from analysis as are bound to cause huge delays and could be outliers. Only flight which reached their destinations are included in the analysis.

Both arrival and departure delay mintues are fairly normally distributed and does not require any transformations.

Bivariate Exploration

Let us check relationship between the variables of interst Arrival Delay, Departure delay, Arrival delay blocks, Week of the day, Time of the day, Origin , Destinations, routes.
In [40]:
plt.figure(figsize=(30,30));

vars=['ACTUAL_ELAPSED_TIME',
                'AIR_TIME',
                'ARR_DELAY',
#                 'ARR_DELAY_NEW',
#                 'ARR_TIME',
#                 'CRS_ARR_TIME',
#                 'CRS_DEP_TIME',
                'CRS_ELAPSED_TIME',
                'DAY_OF_WEEK',               
                'DEP_DELAY',
#                 'DEP_DELAY_NEW',
                'DEP_TIME',               
#                 'DEST',
#                 'DEST_CITY_MARKET_ID',
                'DISTANCE',
#                 'FIRST_DEP_TIME',
#                 'FLIGHTS',
                'MONTH',
#                 'OP_CARRIER',
#                 'OP_CARRIER_AIRLINE_ID',
#                 'ORIGIN',
#                 'ORIGIN_CITY_MARKET_ID',
                'QUARTER',                    
                'TAXI_IN',
                'TAXI_OUT',
                'WHEELS_OFF',
                'WHEELS_ON']

sb.pairplot(data=df_ndely.sample(1000), 
            vars=vars,
            plot_kws={'alpha':1/3}
            )
Out[40]:
<seaborn.axisgrid.PairGrid at 0x23b9de34f60>
<Figure size 2160x2160 with 0 Axes>
On closer examination of picture of the scatterplot there are no surprising relationships emerge, to get better views of the relationship let is plot a heatmap with correlation between variable. So we could see from the scatter plot, there are not quadratic or cubic relationships seen. Heat map should of coorelations should give fairly good picture
In [41]:
plt.figure(figsize = [18, 15]);
sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
           cmap = 'viridis', center = 0)
plt.show();
We could only see all natural co-relations. Acutal Elapsed Time vs Airtime, Acutal Elapsed Time vs CRS Elapsed Time , Acutal Elapsed Time vs Distance, Air-Time vs Distance, Air-Time vs CRS-elapsed-time, Arr-delay vs Departure-Delay, CRS Elapsed Time vs distance, Departure time vs Wheels off, Departure time vs Wheels on, Wheels on vs Wheels off,
Let us check pairwise relationships. Arrival delay minutes by airports to understand if there are any relationship
In [42]:
plt.figure(figsize = [18, 15]);

vars=[ 'ARR_DELAY_NEW',
                'ARR_TIME',
                'CRS_ARR_TIME',
                'CRS_DEP_TIME',
                'CRS_ELAPSED_TIME',
                'DEP_DELAY_NEW',             
                'DEST',
                'DEST_CITY_MARKET_ID',
                'FIRST_DEP_TIME',
                'MONTH',
                'OP_CARRIER',
                'OP_CARRIER_AIRLINE_ID',
                'ORIGIN',
                'ORIGIN_CITY_MARKET_ID']


sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
           cmap = 'viridis', center = 0)
plt.show();
Let us understand the relation between delay caterories and delays
In [196]:
plt.figure(figsize = [13, 10]);

vars=[ 'ARR_DELAY_NEW',
       'DEP_DELAY_NEW',             
       'LATE_AIRCRAFT_DELAY',
       'NAS_DELAY',
       'CARRIER_DELAY',
       'WEATHER_DELAY',
       'SECURITY_DELAY',
      ]


sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
           cmap = 'viridis', center = 0)
plt.show();
Based on the ealier anlysis we understand the could be relation between catergory of delay and Arrival or Departure delay
The above heat map should, Depature and Arrival delays are highly related. Carrier delay and Late Aircraft delays and Over all delays are fairly related.
Let us check pairwise relationships. Arrival delay minutes by airports to understand if there are any relationship
In [43]:
a = df_H12019.groupby(by='DEST')['ARR_DELAY'].sum().sort_values(ascending=False)
In [44]:
plt.figure(figsize=(15,7))
sb.barplot(x=a.index, y=a.values, order=a.index)
plt.xticks(rotation=90);
Let zoom in further to understand airport which caued more the 100k minutes and more delay H12019
In [45]:
b = a[a > 100000]
plt.figure(figsize=(15,7))
sb.barplot(x=b.index, y=b.values, order=b.index)
plt.xticks(rotation=90);
plt.title('US Airports causing more the 100k minutes or more arrival delay in H12019');
plt.ylabel('Total Minitues');
plt.xlabel('Destination Airport');
We could understand that signficant delay is caused at ORD and DFW airports ( this are Chicago and Dallas Aiport)
Let is check mean and quratile delay for these airports
In [46]:
c = df_H12019[df_H12019['DEST'].isin(b.index.values)]
In [47]:
d = b.index.values
d
Out[47]:
array(['ORD', 'DFW', 'EWR', 'LGA', 'SFO', 'DEN', 'BOS', 'IAH', 'CLT',
       'LAX', 'MCO', 'LAS', 'PHX', 'DCA', 'JFK', 'FLL', 'ATL', 'DTW',
       'MSP', 'PHL', 'IAD', 'RDU', 'BNA', 'TPA', 'STL', 'MIA', 'SAN',
       'CLE', 'CVG', 'SLC', 'AUS', 'MSY', 'IND', 'MCI', 'SEA', 'MDW',
       'PIT', 'CMH', 'SAT', 'DAL', 'OKC', 'SMF', 'SJU', 'HOU'],
      dtype=object)
In [48]:
plt.figure(figsize=(15,12))
sb.stripplot(data=c, x='DEST', y='ARR_DELAY_NEW', 
             order=d,
            )
plt.yticks(np.arange(0,c['ARR_DELAY_NEW'].max(),30));
plt.xticks(rotation=90);
plt.title('General distribution of delay minutes for US Airports causing total delay over 100k minutes in H1 2019');
General distribution of delay minutes for US Airports causing total delay over 10000 minutes in H1 2019
In [49]:
colour = sb.color_palette()[4]
plt.figure(figsize=(15,10))
sb.violinplot(data=c, x='DEST', y='ARR_DELAY_NEW', order=d)
plt.yticks(np.arange(0,c['ARR_DELAY_NEW'].max(),30));
plt.xticks(rotation=90);
We could observe that the mean and quartiles lies with in 60 miniutes time but there are many out-liers making the total delay minutes extremly high. Let us find out the distribution of arrival delays with in hour
In [50]:
df_60 = c[c['ARR_DELAY_NEW'] <= 60 ]
In [51]:
plt.figure(figsize=(15,8))
sb.pointplot(data=df_60, 
           x='DEST',
           y='ARR_DELAY_NEW',
           order=d,
           estimator=np.mean,
           ci='sd',
#            notch=True,
#            fliersize=1/3,
          )
plt.yticks(np.arange(0,df_60['ARR_DELAY_NEW'].max(),5));
plt.xticks(rotation=90);
plt.title('Arrival Delay in US airport within 60 minutes');
The mean delay minutes are 5 mintues and standard deviation is around 15 minutes which mean with in 60 minutes. Let is do the same analysis for more then 60 minutes delay
In [52]:
df_GT60 = c[c['ARR_DELAY_NEW'] > 60 ]
In [53]:
plt.figure(figsize=(15,8))
sb.pointplot(data=df_GT60, 
           x='DEST',
           y='ARR_DELAY_NEW',
           order=d,
           estimator=np.mean,
           ci='sd',
#            notch=True,
#            fliersize=1/3,
          )
plt.yticks(np.arange(0,df_GT60['ARR_DELAY_NEW'].max(),60));
plt.xticks(rotation=90);
plt.title('Mean Arrival Delay in US airport greater than 60 minutes');
When the delays are more than 60 minutes the mean delay minutes 180 mintues and standard deviation is around 150 minutes.
Let us consider the Depatured delay by Airline as we earlier observer major cause of departure delay are due to Carrier and Late Arrival Delays
In [54]:
i = df_H12019.groupby(by='OP_UNIQUE_CARRIER')['DEP_DELAY'].sum().sort_values(ascending=False)
In [55]:
plt.figure(figsize=(15,7))
sb.barplot(x=i.index, y=i.values, order=i.index)
plt.xticks(rotation=90);
plt.title('Total Delay by Carrier');
Southwest Airlines (WN) has cause the most delays. Hawaiian airlines (HA) has best track record of lowest cumulative delay minutes. Let us now understand delay by trips which give much closer look of performace of airline
In [56]:
j = df_H12019.groupby(by='OP_UNIQUE_CARRIER')['DEP_DELAY'].mean().sort_values(ascending=False)
In [57]:
plt.figure(figsize=(15,7))
sb.barplot(x=j.index, y=j.values, order=j.index)
plt.xticks(rotation=90);
plt.title('Mean Delay by Carrier');
Eva Airline (EV) and Jetblue (B6) are worst peforming in terms of average delay times still Hawillian Airlines stands out.
Let us consider if any routes have worst timing performace.
In [58]:
k = df_H12019
k['ROUTE'] = k['ORIGIN']+'-'+k['DEST']
k1 = k.groupby(by='ROUTE')['DEP_DELAY_NEW'].sum().sort_values(ascending=False)
k2 = k.groupby(by='ROUTE')['ARR_DELAY_NEW'].sum().sort_values(ascending=False)
k3 = k.groupby(by='ROUTE')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
k4 = k.groupby(by='ROUTE')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
Let us pick routes with total delay of 50000 minutes and more
In [59]:
k1 = k1[k1>50000]
k2 = k2[k2>50000]
In [60]:
fig, ax = plt.subplots(nrows=2, figsize=[15,10])

sb.barplot(x=k1.index, 
           y=k1.values,
             color='g',
             order=k1.index,
             ax=ax[0]
            );
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Total Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Total Departure Delay over 500k minutes on Air routes ')

sb.barplot(x=k2.index,
           y=k2.values,
           color='g',
           order=k2.index,
           ax=ax[1]
            );
plt.xlabel('Air Routes') 
ax[1].set_ylabel('Total Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
ax[1].set_title('Total Arrival Delay over 500k minutes on Air routes ')



plt.show();
We could find the ORD to LGD and reverse direction are most delayed flights, followed by SFO to LAX and reverse direction. This could be because of large number of flights between these airports. Mean minutes could give more clearer picture.
Let us pick routes with mean delay of 30 minutes and more
In [61]:
k3 = k3[k3 > 30.0]
k4 = k4[k4 > 30.0]
In [62]:
fig, ax = plt.subplots(nrows=2, figsize=[15,10])

sb.barplot(x=k3.index, 
           y=k3.values,
             color='g',
             order=k3.index,
             ax=ax[0]
            );
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Mean Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)

sb.barplot(x=k4.index,
           y=k4.values,
           color='g',
           order=k4.index,
           ax=ax[1]
            );
plt.xlabel('Air Routes') 
ax[1].set_ylabel('Mean Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)


plt.show();
Let us zoom it further to mean delay of 60 minutes to understand routes with regular delay
In [63]:
k5 = k3[k3 > 60.0]
k6 = k4[k4 > 60.0]
In [64]:
fig, ax = plt.subplots(nrows=2, figsize=[15,10])

sb.barplot(x=k5.index, 
           y=k5.values,
             color='g',
             order=k5.index,
             ax=ax[0]
            );
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Mean Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)

sb.barplot(x=k6.index,
           y=k6.values,
           color='g',
           order=k6.index,
           ax=ax[1]
            );
plt.xlabel('Air Routes') 
ax[1].set_ylabel('Mean Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)


plt.show();
Mean delay mintues shows a different picture. When there a high mean shows consistant delays VSP to SRQ route has huge delays for both arrival and Departure minutes. Followed by AZA-MEM route. We could plot to all the flight to and fron these airport to understand if this is the case.
Fort Wilton beach to Sarasota-Bradenton international is cause of most delays. Both these airports are in Florida on south cost, Both these airports are not majoy airports and major routes. AZA - Phonix Mesa Gateway airport to MEM - Memphis airport are showing next most average delays
Let us analyis delay by Flight numbers and flight tail Numbers
In [65]:
df_flnum = df_H12019
df_flnum['FL_NUM'] = df_flnum['OP_UNIQUE_CARRIER'] + df_flnum['OP_CARRIER_FL_NUM'].astype(str)
In [66]:
flnum1 = df_flnum.groupby(by='FL_NUM')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
flnum2 = df_flnum.groupby(by='FL_NUM')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
In [67]:
fig, ax = plt.subplots(nrows=2, figsize=[20,12])

sb.barplot(x=flnum1[:50].index, 
           y=flnum1[:50].values,
           color='g',
           order=flnum1[:50].index,
           ax=ax[0]
        );
ax[0].set_xlabel('Flight Numbers')
ax[0].set_ylabel('mean Departure Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Top 50 Mean dealy minutes')

sb.barplot(x=flnum2[:50].index,
           y=flnum2[:50].values,
           color='g',
           order=flnum2[:50].index,
           ax=ax[1]
            );
plt.xlabel('Flight Numbers') 
ax[1].set_ylabel('Mean Arrival Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)


plt.show();
Notoriously fight number G45814 is causing maximum delay. But it seems to be private jet so we could omit this. OO4711 is next greatest delay causing flight
Let us check the delay by Aircraft Tail Number that is any specific aircraft causing delays
In [68]:
tlnum1 = df_H12019.groupby(by='TAIL_NUM')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
tlnum2 = df_H12019.groupby(by='TAIL_NUM')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
In [69]:
fig, ax = plt.subplots(nrows=2, figsize=[20,12])

sb.barplot(x=tlnum1[:50].index, 
           y=tlnum1[:50].values,
           color='g',
           order=tlnum1[:50].index,
           ax=ax[0]
        );
ax[0].set_xlabel('Flight Tail Numbers')
ax[0].set_ylabel('mean Departure Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Top 50 Mean dealy minutes')

sb.barplot(x=tlnum2[:50].index,
           y=tlnum2[:50].values,
           color='g',
           order=tlnum2[:50].index,
           ax=ax[1]
            );
plt.xlabel('Flight Tail Numbers') 
ax[1].set_ylabel('Mean Arrival Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)


plt.show();
Only 4 flights have mean delays more than 100 minutes. N684RW', 'N656YX', 'N2341U', 'N728AN We could further role in delays in combination with other factors

The relationships observed in this part of the investigation. How the feature(s) of interest vary with other features in the dataset is explained?

Airport vs Arrival Delay On average 5 minutes delay for airports with more than 100k mnutes total delay and delay is less that 1 hours. When the delay is more than 1 hour the average delay is 180 minutes.

Airlines vs Departure delay EVA airlines, and Jetblue airlines are worse performing airlines in term of average delay .They are at 17.5 minutes delay on average.

Airport Routes - Total delays is shown in Chiago to Newyork and LA to SFO routes because of relatively large number of flighys, The large mean delay is seen in minor airports and Minor routes in Florida and Phenix

Flight Number - There are few flight numbers especially flight OO4711 is greatest delay causing flights other than other private jets

Tail Numbers - There are specific Air crafts causing delays are N684RW, N656YX, N2341U, N728AN

Interesting relationships between the other features (not the main feature(s) of interest) are observed.

Few AIRPORT, Routes, Aircrafts, Flight number are causing most delays. There are also some days and time of the day cusing delays. This will be investigated further in Multivariate analysis.

Multivariate Exploration

So for we have see that , Week of the day, Time of the day, Specific routes, Airports and even specific aircrafts are causing delay.
Departure delay have different reason from Arrival Delay. Now we will consider combination these factor which cause delays
For each of these catergories let us find our Major cause of delay
First Delay by Week days , time of the day vs Cause of delay. We have observed most delays in SUN, WED THU SAT and time of the day between 1500 to 2159 for Depature dealy and 1600 to 2359 for Arrival Delay
In [148]:
# Select only most delay causing days SUN, WED, THUs , SAT
df_w = df_H12019[df_H12019['DAY_OF_WEEK'].isin([1,4,5,7])]

# Select flight where the is 15 minutes or more arrival or departure delays
df_w = df_w.query('DEP_DELAY_GROUP > 0.0 | ARR_DELAY_GROUP > 0.0')

# Select more than flights with more than 15 mins arrival and departure delay.
df_wd = df_w.query('DEP_DEL15 > 0 and ARR_DEL15 > 0 ')

#select flights with departed in the hour blocks when most delays are seen ( 1500 - 2159)
df_wd = df_wd[df_wd['DEP_TIME_BLK'].isin(['1500-1559','1600-1659','1700-1759','1800-1859','1900-1959','2000-2059','2100-2159'])]

#Select flight which has no departure delays but more than 15 mins arrival delays
df_wa = df_w.query('DEP_DEL15 == 0 and ARR_DEL15 > 0')

#select flights in time block where most arrival delays are observed 1600 - 2359
df_wa = df_wa[df_wa['ARR_TIME_BLK'].isin(['1600-1659','1700-1759','1800-1859','1900-1959','2000-2059','2100-2159','2200-2259','2300-2359'])]

#
del df_w
In [149]:
# create faceted heat maps Day of the week Time of the day and cause for delay for departures
g = sb.FacetGrid(data = df_wd, 
                 col = 'DEP_TIME_BLK',
                 row = 'DAY_OF_WEEK',
                 size = 3,
                 margin_titles = True,
                )

plt.xticks(rotation='vertical')

g.map(plt.bar,
      x=df_wd[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).index,
      height=df_wd[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).values,
      color = 'g', )

g.set_xlabels('Delay Categories')
g.set_ylabels('Mean Delay Minutes')

plt.show()
We could observer most of the delay are caused by late Aircraft catergory . Further drill down to understand which of the Airlines causing most of delays
In [150]:
# Get only records which has Late Aircraft Delay catergory 
df_wd1 = df_wd[df_wd['LATE_AIRCRAFT_DELAY'] > 0.0 ]
df_wd1.sample()
Out[150]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR ROUTE FL_NUM
419926 103.0 86.0 1.0 19.0 1.0 19.0 1934.0 1900-1959 NaN 0 0.0 1915 1835 100.0 28 4 1.0 16.0 1.0 16.0 1851.0 1800-1859 LAS 12889 1288903 32211 Las Vegas, NV NV 32 Nevada 85 519 3 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-03-28 16.0 NaN 3 3.0 WN 19393 909 WN BOI 10713 1071302 30713 Boise, ID ID 16.0 Idaho 83.0 1 0.0 N762SW 5.0 12.0 NaN 0.0 1903.0 1929.0 2019 BOI-LAS WN909
In [151]:
df_wd1.groupby(by='OP_UNIQUE_CARRIER')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index
Out[151]:
Index(['YV', 'UA', 'NK', '9E', 'F9', 'OH', 'B6', 'EV', 'OO', 'AS', 'YX', 'AA',
       'G4', 'DL', 'MQ', 'WN', 'HA'],
      dtype='object', name='OP_UNIQUE_CARRIER')
In [152]:
sb.catplot(data = df_wd1,
           x='OP_UNIQUE_CARRIER',
           y='LATE_AIRCRAFT_DELAY', 
           hue='DAY_OF_WEEK',
           kind='bar',
           height=10,
           aspect=2,
           palette = 'Blues',
           estimator=np.mean,
           order=df_wd1.groupby(by='OP_UNIQUE_CARRIER')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index,
           )

plt.suptitle('Airlines Causing most departure delays due Late Aircraft arrival by Week day')
plt.xlabel('Airline Code');
plt.ylabel('Mean Delay in Minutes');
plt.show();
Clearly we can see YU- EuroAtalintic is least puntual followed by United Airlines are worst performers in punctuality. We could observer average delay of 80 minutes for these airline during Saturdays.
Let us further drill down the cause due to any specfic aircraft causing delays
In [153]:
df_wd2 = df_wd1[df_wd1['OP_UNIQUE_CARRIER'].isin(['YV','UA'])].sort_values(by='LATE_AIRCRAFT_DELAY',ascending=False)[:50]
df_wd2.sample()
Out[153]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR ROUTE FL_NUM
157940 92.0 76.0 1.0 660.0 12.0 660.0 1018.0 2300-2359 NaN 0 0.0 2318 2040 98.0 27 4 1.0 666.0 12.0 666.0 746.0 2000-2059 DSM 11423 1142307 31423 Des Moines, IA IA 19 Iowa 61 589 3 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-06-27 660.0 NaN 6 0.0 UA 19977 1597 UA NaN 11292 1129202 30325 NaN NaN NaN NaN NaN 2 0.0 N73275 6.0 10.0 NaN 0.0 756.0 1012.0 2019 NaN UA1597
In [154]:
df_wd2.groupby(by='TAIL_NUM')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index[:50]
Out[154]:
Index(['N852UA', 'N66837', 'N89304', 'N88330', 'N777UA', 'N521LR', 'N80343',
       'N902FJ', 'N221UA', 'N86322', 'N896UA', 'N37263', 'N37273', 'N89308',
       'N425UA', 'N89357', 'N841UA', 'N13227', 'N37437', 'N73251', 'N14102',
       'N588UA', 'N85354', 'N75851', 'N475UA', 'N28457', 'N88326', 'N87318',
       'N76514', 'N89313', 'N908FJ', 'N73275', 'N89315', 'N45440', 'N24202',
       'N809UA', 'N435UA', 'N489UA', 'N78285', 'N33132', 'N402UA', 'N496UA',
       'N26123', 'N76529', 'N77296', 'N33289', 'N467UA', 'N953LR'],
      dtype='object', name='TAIL_NUM')
In [155]:
sb.catplot(data = df_wd2,
           x='TAIL_NUM',
           y='LATE_AIRCRAFT_DELAY', 
           hue='OP_UNIQUE_CARRIER',
           kind='bar',
           height=10,
           aspect=2,
           palette = 'Greens',
           order=df_wd2.groupby(by='TAIL_NUM')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index,
           )

plt.suptitle('Top 50 Aircrafts Causing most departure delays due Late Aircraft arrival by Airline Company YU and UA',
            fontsize=20.0)
plt.xticks(rotation=90)
plt.xlabel('Aircraft Tail Number')
plt.ylabel('Mean Late Arrival Minutes')
plt.show();
We could clearly understand that top 50 flights of top 2 delay causing airline is around 600 minutes. 5 Airlines from UA and 3 from YU are causing more than 1000 minutes delays. We could further dril down to understand their routes of those aircraft to understand the distribution of Late Aircraft delay and Departure delay
In [156]:
plt.figure(figsize=(10,8))
sb.scatterplot(data=df_wd2, 
           x='LATE_AIRCRAFT_DELAY',
           y='DEP_DELAY',
           hue='OP_CARRIER_FL_NUM',
           palette='viridis_r',
           style='OP_UNIQUE_CARRIER'
#               x_jitter=0.3,
#               y_jitter=0.3,
#               alpha=0.3)
              )
plt.legend(title='Flight Number')
plt.suptitle('Distribution of Departure delays and Late Aircraft arrival delay')
plt.xlabel('Late Aircraft Delay in Minutes')
plt.ylabel('Departure Delay in Minutes')
Out[156]:
Text(0, 0.5, 'Departure Delay in Minutes')
We could understand YU flights departure delay is comensurate to the Late arrival delay . But we could see UA airlines has further added to already late arrived flights.
We could clealy say UA is worst performing in flights UA2500 Houston,TX to Settle.WA & UA5000 Newyork to Clevland .
YV flight ASH 7500 is the next worst performing next to UA fligths.
Ler us further continue the Analysis of Arrial Delays. We will for find the major reason for Arrival delays
In [157]:
# create faceted heat maps Day of the week Time of the day and cause for delay for departures
g = sb.FacetGrid(data = df_wa, 
                 col = 'ARR_TIME_BLK',
                 row = 'DAY_OF_WEEK',
                 size = 3,
                 margin_titles = True,
                )

plt.xticks(rotation='vertical')

g.map(plt.bar,
      x=df_wa[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).index,
      height=df_wa[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).values,
      color = 'g', )

g.set_xlabels('Delay Categories')
g.set_ylabels('Mean Arrival Delay Minutes')

plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Above plots shows that single most prominent cause of Late Arrivals is NAS delays attributable air traffic congition at airport. We can further inverstiage any particuar airport is more congested at peak arrival time of the day
In [158]:
# Get only records which has NAS Delay catergory 
df_wa1 = df_wa[df_wa['NAS_DELAY'] > 0.0 ]
df_wa1.sample()
Out[158]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR ROUTE FL_NUM
273530 390.0 354.0 1.0 19.0 1.0 19.0 2244.0 2200-2259 NaN 0 0.0 2225 1915 370.0 10 7 0.0 -1.0 -1.0 0.0 1914.0 1900-1959 SFO 14771 1477104 32457 San Francisco, CA CA 6 California 91 2457 10 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-03-10 0.0 NaN 3 19.0 AS 19930 1127 AS BWI 10821 1082106 30852 Baltimore, MD MD 24.0 Maryland 35.0 1 0.0 N623VA 25.0 11.0 NaN 0.0 1925.0 2219.0 2019 BWI-SFO AS1127
In [169]:
# Group by destiation Aiport to find the mean NAS Delay by airport
df_wa2 = df_wa1.groupby(by='DEST')['NAS_DELAY'].mean().sort_values(ascending=False)
df_wa2.size
Out[169]:
325
In [173]:
plt.figure(figsize=(15,10));
sb.barplot(df_wa2.index[:50], df_wa2.values[:50], color = 'g')
plt.xticks(rotation=90);
plt.suptitle('Mean NAS Delay Minutes of Aiports with Top 50 NAS Delay Minutes', fontsize=10);
plt.xlabel('Destination Airport');
plt.ylabel('Minutes');

plt.show();
We could see IAG - Niagra falls international Airport has extremly high delay due to NAS delay. IMT, ALO, BFM, ATY all have above 40 mins average NAS delay. Let us concentrate of these Airports and check what time of the day has most arrival delays
Let us see the NAS delay vs Arrival Delay for top 5 airports
In [189]:
df_wa4 = df_H12019.query("DEST in ['IAG','IMT','ALO','BFM','ATY']").query("NAS_DELAY > 0.0")
df_wa4.sample()
Out[189]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR ROUTE FL_NUM
48514 182.0 158.0 1.0 49.0 3.0 49.0 1900.0 1800-1859 NaN 0 23.0 1811 1424 167.0 29 3 1.0 34.0 2.0 34.0 1458.0 1400-1459 BFM 10562 1056202 30562 Mobile, AL AL 1 Alabama 51 1132 5 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-05-29 11.0 NaN 5 15.0 F9 20436 234 F9 DEN 11292 1129202 30325 Denver, CO CO 8.0 Colorado 82.0 2 0.0 N336FR 9.0 15.0 NaN 0.0 1513.0 1851.0 2019 DEN-BFM F9234
In [190]:
sb.catplot(data=df_wa4,
          x='DEST',
          y='ARR_DELAY_NEW',
          hue='DAY_OF_WEEK',
          kind='bar',
          height=10,
          aspect=2,
          palette = 'Blues',
          estimator=np.mean,
          order=df_wa3.groupby(by='DEST')['ARR_DELAY_NEW'].mean().sort_values(ascending=False).index,
           )

plt.suptitle('Top 5 Airports Causing most Arrival delays due to NAS delay by Week day',
            fontsize=20)
plt.xlabel('Airport Code', fontsize=20);
plt.ylabel('Arrival Delay in Minutes', fontsize=20);
plt.show();
We could observe that IAG and ALO airports have evenly distributed Arrival Delays. which shows , these airports are operating beyond its capacity to handle currently see high traffic. IMT has spick on Saturday and Wednesdays, These day could have high number of flights landing,
In [179]:
# Select only top Destination with 40 mins or more Arrival delays
df_wa3 = df_wa1.query("DEST in ['IAG','IMT','ALO','BFM','ATY']")
df_wa3.sample()
Out[179]:
ACTUAL_ELAPSED_TIME AIR_TIME ARR_DEL15 ARR_DELAY ARR_DELAY_GROUP ARR_DELAY_NEW ARR_TIME ARR_TIME_BLK CANCELLATION_CODE CANCELLED CARRIER_DELAY CRS_ARR_TIME CRS_DEP_TIME CRS_ELAPSED_TIME DAY_OF_MONTH DAY_OF_WEEK DEP_DEL15 DEP_DELAY DEP_DELAY_GROUP DEP_DELAY_NEW DEP_TIME DEP_TIME_BLK DEST DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC DISTANCE DISTANCE_GROUP DIV1_AIRPORT DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_LONGEST_GTIME DIV1_TAIL_NUM DIV1_TOTAL_GTIME DIV1_WHEELS_OFF DIV1_WHEELS_ON DIVERTED DIV_ACTUAL_ELAPSED_TIME DIV_AIRPORT_LANDINGS DIV_ARR_DELAY DIV_DISTANCE DIV_REACHED_DEST FIRST_DEP_TIME FLIGHTS FL_DATE LATE_AIRCRAFT_DELAY LONGEST_ADD_GTIME MONTH NAS_DELAY OP_CARRIER OP_CARRIER_AIRLINE_ID OP_CARRIER_FL_NUM OP_UNIQUE_CARRIER ORIGIN ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC QUARTER SECURITY_DELAY TAIL_NUM TAXI_IN TAXI_OUT TOTAL_ADD_GTIME WEATHER_DELAY WHEELS_OFF WHEELS_ON YEAR ROUTE FL_NUM
66079 252.0 160.0 1.0 92.0 6.0 92.0 2045.0 1900-1959 NaN 0 0.0 1913 1633 160.0 9 7 0.0 0.0 0.0 0.0 1633.0 1600-1659 IAG 12265 1226505 32265 Niagara Falls, NY NY 36 New York 22 1069 5 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 NaN NaN NaN NaN 1 2019-06-09 0.0 NaN 6 92.0 G4 20368 942 G4 NaN 14112 1411206 33195 NaN NaN NaN NaN NaN 2 0.0 309NV 4.0 88.0 NaN 0.0 1801.0 2041.0 2019 NaN G4942
In [191]:
sb.catplot(data=df_wa3,
          x='DEST',
          y='NAS_DELAY',
          hue='DAY_OF_WEEK',
          kind='bar',
          height=10,
          aspect=2,
          palette = 'Blues',
          estimator=np.mean,
          order=df_wa3.groupby(by='DEST')['NAS_DELAY'].mean().sort_values(ascending=False).index,
           )

plt.suptitle('Top 5 Airports Causing most Arrival delays due to NAS delay for SUN, WED, THU, SAT',
            fontsize=20)
plt.xlabel('Airport Code', fontsize=20);
plt.ylabel('Mean Delay in Minutes', fontsize=20);
plt.show();

The relationships observed in this part of the investigatio .There were features that strengthened each other in terms of looking at the feature(s) of interest.

We have see Departure delays are primarly caused by Late Aircraft Arrival delay. This is attributable to the Airlines.

We could further see that YU and UA are worst performing Airline for departure delays.

We narrowed down to 3 flight numbers which have consistant delayed arrival. One of the routes is Houston to Settel and other is Newyork to Clevland

The Arrival delays are more to do with NAS delays which are due to Airports ATS congession.

IAG ,IMT, ALO, BFM, ATY are 5 most arrival delay causing airports.

Futher breaking down by Day of the week IAG- Niagra falls Airport is congested only on Saturdays. IMT ford airports only on Thurdays. ALO Waterloo airport on Sundays, Wednesday, Thursday and Sundays, but mostly on Wednesdays. BMF airport on Sundays . ATY airport on Saturdays.

There should be reason for Air traffic congestion on these specific day for specific airport. These have to be further investigated to understand the root cause.

IAG and ALO airports are looks to have fairly conjested air traffic, IMT is worst performing with Arrival delays peaking at 350 minutes on saturdays, This will certainly need closer examination.

There were few interesting or surprising interactions between features.

The Saturdays have most delayed arrival aircrafts and Sundays have next most delayed arrivals

Arrival delays are due to NAS delays and Departure delays are due to late arrival of Aircraft.

Flight number , Aircrafts cause most departure delays. Some destination airports cause extreamly high arrival delays

Carrier delays are the second most prominent cause of departure delays